Relevance Propagation Model for Large Hypertext Documents Collections

نویسندگان

  • Idir Chibane
  • Bich-Liên Doan
چکیده

Web search engines have become indispensable in our daily life to help us finding the information we need. Several search tools, for instance Google, use links to select the matching documents against a query. In this paper, we propose a new ranking function that combines content and link rank based on propagation of scores over links. This function propagates scores from source pages to destination pages in relation with query terms. We assessed our ranking function with experiments over two test collections WT10g and GOV. We conclude that propagating link scores according to query terms provides significant improvement for information retrieval. Introduction A major focus of information retrieval (IR) research is on developing strategies for identifying documents that are “relevant” to a given query. In traditional IR, the evidence of relevance is thought to reside within the text content of documents. Consequently, the fundamental strategy of traditional IR is to rank documents according to their estimated degree of relevance based on measures such as term similarity or term occurrence probability. In the Web setting, however, information can reside outside the textual content of documents. For example, links between pages can be used to increase the term based estimation of document relevance. Furthermore hyperlinks, being the most important source of evidence in Web documents, have been the subject of many researches exploring retrieval strategies based on link analysis. The explosive growth of the web has led to surge of research activity in the area of IR on the World Wide Web. Ranking has always been an important component of any information retrieval system (IRS). In the case of web search its importance becomes critical. Due to the size of the Web (Google counted more than 8.16 billion Web pages in August 2005), it is imperative to have a ranking function that capture the user needs. To this end the Web offers a rich context of information which is expressed via the links. In recent years, several information retrieval methods using the information about the link structure have been developed. Actually, most of systems based on link structure information combine content with a popularity measure of the page to rank query results. Google’s PageRank (Brin et al., 1998), and Keinberg’s HITS (Kleinberg, 1999) are two fundamental algorithms employing the link structure in the Web page. A number of extensions of these two algorithms are also proposed, such as (Lempel et al., 2000) (Haveliwala, 2002) (Kamvar et al., 2003) (Jeh et al., 2003) (Deng et al., 2004) and (Xue-Mei et al., 2004). All these link analysis algorithms are based on two assumptions: (1) If there is a link from page A to page B, then we may assume that page A endorses and recommends the content of page B. (2) Pages that are co-cited by a certain page are likely to share the same topic as well as to help retrieval. The power of hyperlink analysis comes from the fact that it uses the content of other pages to rank the current page. Hopefully, these pages were created by authors independently from the author of the original page, thus adding an unbiased factor to the ranking. The study of the existing systems enabled us to conclude that most of ranking functions using link structure do not depend on query terms. However, the precision of the found results may Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 Copyright C.I.D. Paris, France decrease significantly. In this paper we investigate, theoretically and experimentally, the application of link analysis to rank pages on the Web. The rest of this paper is organized as follows. In Section 2, we review the recent works on link analysis. We first review the related work on link analysis ranking algorithms and also present some extensions of these algorithms. Then we present our information retrieval model with the new ranking function. In section 4, we show the experimental results on multiple queries using the proposed algorithm, including a comparative study of different algorithms. In Section 5, we summarize our main contributions and discuss possible new applications for our proposed method. Previous Work Different from traditional IR, the Web contains both content and link structures that have provided many new dimensions for exploring better IR techniques. In the early days, people analyzed web content and structure independently. Typical approaches such as (Hawking, 2000) (Craswell et al., 2003) (Craswell et al., 2004) use TF-IDF (Salton et al., 1975) of the query term in the page to compute a relevance score, and use hyperlinks to compute a query-independent importance score (e.g. PageRank (Brin et al., 1998)). And then these two scores are combined to rank the retrieved documents. In recent years, some new methodologies that explore the inter-relationship between content and link structures have been introduced. (Qin et al., 2005) divides these methods into two categories: the first one is used to enhance the link analysis with the help of content information (Kleinberg, 1999) (Lempel et al., 2000) (Haveliwala, 2002) (Amento et al., 2000) (Chakrabarti, 2001) (Chakrabarti et al., 2001) (Ingongngam et al., 2003).; the second one is the relevance propagation, which propagates content information with the help of Web structure (Mcbryan, 1994) (Song et al., 2004) (Shakery et al., 2003). HITS (Kleinberg, 1999) is the representative of the first category. The HITS algorithm builds a query specific sub-graph first, and then computes the authority and hub scores on this sub-graph to rank the documents. Kleinberg distinguishes between two different notions of relevance: An authority is a page that is relevant in itself and a hub is a page that is relevant since it contains links to many related authorities. To identify good hubs and authorities, Kleinberg’s procedure exploits the graph structure of the web. When introducing a query, the procedure first constructs a focused sub graph G, and then computes hubs and authorities scores for each node of G. In order to quantify the quality of a page as a hub and an authority, Kleinberg associated every page with a hub and an authority weight. According to the mutual reinforcing relationship between hubs and authorities, Kleinberg defined the hub weight to be the sum of the authority weights of the nodes that are pointed to by the hub, and the authority weight to be the sum of the hub weights that point to this authority. Let A denote the n-dimensional vector of the authority weights, where Ai is the authority weight of the page and let H denote the n-dimensional vector of the hub weights, where Hi is the hub weight of the page pi. The computation of authority and hub weights is given by the following formula:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Relevance Feedback Using Weight Propagation Compared with Information-Theoretic Query Expansion

A new Relevance Feedback (RF) technique called Weight Propagation has been developed which provides greater retrieval effectiveness and computational efficiency than previously described techniques. Documents judged relevant by the user propagate positive weights to documents close by in vector similarity space, while documents judged not relevant propagate negative weights to such neighbouring...

متن کامل

Learning Topic Hierarchies and Thematic Annotations from Document Collections

Large textual and multimedia databases are now widely available but their exploitation is restricted by the lack of metainformation about their structure and semantics. Many such collections like those gathered by most search engines are loosely structured. Some have been manually structured, at the expense of an important effort. This is the case of hierarchies like those of internet portals (...

متن کامل

Information Access via Topic Hierarchies and Thematic Annotations from Document Collections

With the development and the availability of large textual corpora, there is a need for enriching and organizing these corpora so as to make easier the research and navigation among the documents. The Semantic Web research focuses on augmenting ordinary Web pages with semantics. Indeed, wealth of information exists today in electronic form, they cannot be easily processed by computers due to la...

متن کامل

Focused Crawling: A New Approach to Topic-Specific Resource Discovery∗

The rapid growth of the world-wide web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext information management system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exem...

متن کامل

Relevance Feedback With Too Much

Modern text collections often contain large documents which span several subject areas. Such documents are problematic for relevance feedback since inappropriate terms can easily be chosen. This study explores the highly eeective approach of feeding back passages of large documents. A less-expensive method which discards long documents is also reviewed and found to be eeective if there are enou...

متن کامل

Link as You Type: Using Key Phrases for Automated Dynamic Link Generation

When documents are collected together from diverse sources they are unlikely to contain useful hypertext links to support browsing amongst them. For large collections of thousands of documents it is prohibitively resource intensive to manually insert links into each document. Users of such collections may wish to relate documents within them to text that they are themselves generating. This pro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007